Note: This is meant to be both a "capstone project proposal" and a "capstone milestone report".


In [1]:
import pandas as pd
import glob
import os
import numpy as np
from __future__ import division
import sqlalchemy as sqla
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn import linear_model
import matplotlib

Part 1: Data and goals

Data comes from two sources:

  • TLC data: (http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml or (just includes new data) https://data.cityofnewyork.us/data?agency=Taxi+and+Limousine+Commission+%28TLC%29&cat=&type=new_view&browseSearch=&scope= )
    • This data includes gps coordinates for every single yellow cab trip in NYC, as well as fare data, the taxi medallion, and a bunch of other info
  • Census/American Community Survey (ACS) data:
    • Income, commute data, and much, much more.
    • Data from the census makes the best possible attempt to record every single resident of the US, but is mostly just counting population, with only a small amount of other info.
    • Data from ACS uses a random sample and gets diverse demographic information; though some of this information may have very high errors.
    • Lots of sources, I've used https://nhgis.org/, as it allows you to just download the columns you want and not waste hardrive space or time processing lots of data.
    • The census supplies "shapefiles" of the different geographical areas they use at https://www.census.gov/geo/maps-data/data/tiger-line.html. I divided mine up by census tract, so I got census tract shapefiles.
    • Unfortunately, for some fields, there is simply too much error for them to be useful. The ACS contains info on the length of time spent commuting; for car commuters, the error for their data in NYC had a median around 50% and was simply not usuable.
    • It will be interesting to see what effect race has on taxi dropoffs - does the common wisdom on taxis being prejudiced show up in dropoff data?
  • I have also taken a look at some Google maps distance data, but it wasn't very predictive on its own. Maybe combining in some of the demographic data will make it a bit more useful.

My goal will be to predict the number of taxi dropoffs in an area based on demographic and other stats about the area. Ideally, I would like to understand the demographics of taxicab users and how well taxicabs provide service to NYC residents. I may later expand my data to look at green "boro cabs" (which are only allowed to pick up passengers outside of the Manhattan central business district, data also availible from TLC) and Uber cars(https://github.com/fivethirtyeight/uber-tlc-foil-response), to compare their demographics of usage.

If I do something with the Uber data, I plan to write up a story in the style of fivethirtyeight.com that I may submit to them, as they have done a series of stories on this Uber data.

I would also expect that a paper comparing taxi services would be useful for city government, who want to ensure that there is good coverage throughout the city and throughout demographic groups.

Part 2: Taxi trip data processing

this might be a little boring. If you get bored, just skip it.

I don't have a lot of RAM on my computer, so I can't manage to hold a whole year's (or even a month's without trouble) worth of individual trip data. Instead, I get a chunk of 300,000 trip dropoff gps locations, and then group them by the gps coordinates, rounded off to 4 decimal places so that I actually have some data reduction.


In [ ]:
path=os.path.expanduser("~/Documents/TaxiTripData/2013taxi_trip_data/")
files=glob.glob(path+'*.csv.zip.gz')

In [ ]:
countlist=[]
cunk=3e5
counts=pd.DataFrame()
for i in files:
    print(i)
    iterate=pd.read_csv(i,usecols=[12,13],iterator=True,chunksize=cunk,names=['dropoff_longitude','dropoff_latitude'],header=0)
    counter=0
    for test in iterate:
        test=test[InNYCarea(test)]
        test['dropoff_latitude']=test['dropoff_latitude'].round(decimals=4)
        test['dropoff_longitude']=test['dropoff_longitude'].round(decimals=4)
        if counter==0:
            counts=test.groupby(['dropoff_latitude','dropoff_longitude']).size()
        else:
            counts=counts.add(test.groupby(['dropoff_latitude','dropoff_longitude']).size(),fill_value=0)
        counter+=1

A note about this data: for some reason or other, I was unable to find the NYC gov website for the taxicab data when I was first looking; and instead was getting data from someone's FOIL request from last year. Thus, this data is for 2013. While it is relatively clean, the month of September has something funky: there are about 2 times as many trips as all the other months, and many of the trips repeat clearly un-reproducible data. Basically every trip has at least one other trip that either has the exact same GPS coordinates, pickup time, or dropoff time, but only about 1/4 seem to be exact copies of other trips.

It's possible that later data is cleaner (NYC automated a data visualizer for it, so it probably is), but I figured that I had already been working with this data and that the lack of one month probably won't make much difference. For the final analysis, I'll probably take a look at later data, or use it as a testing set.

Now I want to sort this data by census tract. Python's GIS seems to be pretty rudimentary, so I used PostGIS, a GIS add-on to PostgreSQL, to match GPS coordinates to census tracts. This problem is pretty hard, so I'm pretty glad that I can use somebody else's code to do this.

I created a SQL database that contains the TIGER shapefiles of the NYC census tracts, indexed by the FIPS code, a numeric code assigned by the census to every geographical area they cover. To do this, I basically followed info from this PostGIS tutorial http://workshops.boundlessgeo.com/postgis-intro/index.html.

Then, I can use the PostGIS ST_Contains function to match all the GPS coordinates to FIPS codes, like so:


In [ ]:
engine = sqla.create_engine('postgresql://myuser:notyours@localhost:5432/TaxiData',echo=False)

In [ ]:
counts.to_sql('abridged2013gpscounts',engine,if_exists='replace')

In [ ]:
fipscounts=pd.read_sql_query(
    "SELECT c.geoid10 AS fipscodes, p.counts FROM abridged2013gpscounts AS p, nyctracts AS c WHERE ST_Contains(c.geom, ST_SetSRID(ST_Makepoint(p.dropoff_longitude,dropoff_latitude),4269))",engine)

This has the added benifit of doing some automatic data cleaning, as it throws out any points that don't lie within NYC (sorry middle of the Gulf of Guinea, I don't think that you are really the most popular spot in NYC).


In [ ]:
fips2013counts=fipscounts.groupby('fipscodes').sum()

Then I can add it to a big SQL table of all the data I'm using, with the very creative and descriptive name lotsofdata


In [ ]:
full=pd.read_sql('lotsofdata',engine)
full.set_index('index',inplace=True)
full['abridged2013ycdropoffs']=fips2013counts
full.to_sql('lotsofdata',engine,if_exists='replace')

Part 2: Analysis

This is probably a little more interesting. At least it has pretty pictures.

I'm only using census tracts where the population is greater than 1000. Most of the tracts with population less than that are parks or ocean, and the errors associated with data on these are generally very high. Also, taxi dropoffs are probably not representative of the resident population for these areas.


In [2]:
engine = sqla.create_engine('postgresql://postgres:postgres@localhost:5432/TaxiData',echo=False)
full=pd.read_sql_query('SELECT a.* FROM lotsofdata AS a WHERE a.totalpopulation>=1000',engine).set_index('fipscodes')

For this analysis, I'll be looking at data from three columns of the ACS, with per-capita dropoffs as my dependent variable, i.e. the dropoffs computed above divided by the population given by the census.

This data uses the 2010 census (the most recent), and 5-year data for the American Community Survey, 2007-2011. I figure the ACS data splits the difference between my 2010 population data and the 2013 drop-off data.

The census uses some funky abreviations for their column headers, and I haven't changed them. Here's a description:

  • MOGE001 : Workers 16 years and over who did not work at home.
  • MOGE011 : Workers 16 years and over who did not work at home and commuted by car, truck, or van.
  • MRUE001 : Per Capita Income in the Past 12 Months (in 2011 Inflation-Adjusted Dollars)

I compute the ratio of non-driving commuters to total commuters, labeled nondrivercommuterrat


In [3]:
subset=full[['MOGE001','MOGE011','MRUE001','abridged2013ycdrpoffpc']]
subset['nondrivercommuterrat']=((subset['MOGE001']-subset['MOGE011'])/subset['MOGE001'])
subset.replace(np.inf,np.nan,inplace=True)
subset.dropna(inplace=True)
subset.head()


/usr/local/lib/python2.7/dist-packages/IPython/kernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
/usr/local/lib/python2.7/dist-packages/IPython/kernel/__main__.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
/usr/local/lib/python2.7/dist-packages/IPython/kernel/__main__.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Out[3]:
MOGE001 MOGE011 MRUE001 abridged2013ycdrpoffpc nondrivercommuterrat
fipscodes
36005041300 3231 727 19759 0.031013 0.774992
36005046201 11185 4432 28336 0.021814 0.603755
36061014200 1955 550 153665 20.753026 0.718670
36081017000 864 356 23300 0.038462 0.587963
36081044302 1818 538 18240 0.079456 0.704070

In [40]:
# Plotting the data:
#plt.rc('text', **{'usetex':True,'latex.preamble':[
#       r'\usepackage{siunitx}',   
#       r'\sisetup{detect-all}',   
#       r'\usepackage{helvet}',    
#       r'\usepackage{sansmath}',  
#       r'\sansmath'               
#]  })
comd=0
buff=100

matplotlib.rc('legend', fontsize=10)

a=6
gr=(1+np.sqrt(5))/2
plt.figure(figsize=[a*gr,a])
plt.autoscale(tight=True)
plt.xlabel('Per-capita income')
plt.ylabel('2013 yellow cab dropoffs per-capita')
plt.title("Yellow cab dropoffs vs income of census tract")
a=plt.scatter(subset['MRUE001'],subset['abridged2013ycdrpoffpc'],alpha=0.5,label='Data',c=subset['nondrivercommuterrat'],cmap=plt.cm.get_cmap('viridis'))
cbar=plt.colorbar(a)
cbar.set_label("non-driver commuter ratio")
plt.yscale('log')
plt.xscale('log')
#plt.legend(loc='upper left')
plt.show()


There appears to be a bit of a linear relationship between the logs of the dropoffs per-capita and per-capita income, but it's not that great. There is a funky area where there almost looks like there is a negative slope, but we can see that this area tends to have less non-driver commuters based on the color shown.


In [39]:
#subset=full[['MRUE001','abridged2013ycdrpoffpc']].dropna()
#these are global, you should probably just add them to
#an rc file, which will make them perminant.
#plt.rc('text', **{'usetex':True,'latex.preamble':[
#       r'\usepackage{siunitx}',   
#       r'\sisetup{detect-all}',   
#       r'\usepackage{helvet}',    
#       r'\usepackage{sansmath}',  
#       r'\sansmath'               
#]  })
comd=0.8
asub=subset[(subset['nondrivercommuterrat']>comd)]
bsub=subset[(subset['nondrivercommuterrat']<comd)]
xdata=((((asub['MRUE001']))).as_matrix())
ydata=asub['abridged2013ycdrpoffpc'].as_matrix()
#xrdata=((((subset[subset['nondrivercommuterpc']<=comd]['nondrivercommuterpc']))).as_matrix())
#yrdata=subset[subset['nondrivercommuterpc']<=comd]['abridged2013ycdrpoffpc'].as_matrix()
#plt.xlim(-(0.001)**2,(0.003)**2)
buff=100
XX=np.linspace(xdata.min()-buff,xdata.max()+(buff),num=100)

matplotlib.rc('legend', fontsize=10)

a=6
gr=(1+np.sqrt(5))/2
plt.figure(figsize=[a*gr,a])

#plt.xlim(xdata.min()-buff,xdata.max()+buff)
plt.xlabel('Income')
plt.ylabel('2013 yellow cab dropoffs per-capita')
#plt.ylim(-0.2,7)
plt.title("Yellow cab dropoffs vs income of census tract")
#plt.title("Trip duration matters")
#plt.xlim(0,0.002)
plt.autoscale(enable=True, axis='both', tight=True)
plt.scatter(bsub['MRUE001'],bsub['abridged2013ycdrpoffpc'],facecolors='none',edgecolors='black',alpha=0.1,label='Data, non-driver commuter ratio $<$ 0.8')

a=plt.scatter(xdata,ydata,alpha=0.5,label='Data, non-driver commuter ratio $>$ 0.8',c=asub['nondrivercommuterrat'],cmap=plt.cm.get_cmap('cool'))


#plt.plot(XX,(XX**fit.params['MRUE001'])*np.exp(fit.params['const']),color='red',alpha=0.5,lw=2, label='Fit, $\mathsf{R^2}=$ ' + str(round(fit.rsquared,2)))
#plt.scatter(xrdata,yrdata,alpha=1,label='Data (TLC & google maps)',c=subset[subset['nondrivercommuterpc']<=0.5]['nondrivercommuterpc'],cmap=plt.cm.get_cmap('autumn'))
cbar=plt.colorbar(a)
cbar.set_label("non-driver commuter ratio")
#plt.scatter(xrdata,yrdata,alpha=1,label='Data (TLC & google maps)',color='red')
#plt.colorbar(a)
plt.yscale('log')
plt.xscale('log')

#fitlabel='Fit: '+str(round(res.params['esb-inv'],2))+ '/x - '+ str(abs(round(res.params['const'],2)))
#plt.plot(XX,res.params['esb-inv']/XX+res.params['esb-centroidduration']*XX+res.params['const'],alpha=0.9,color='red',linewidth=2,label=fitlabel)
plt.legend(loc='upper left')
#plt.savefig('2013dropoffs_vs_esb-duration.svg')
#plt.ylim(-0.2,7)
plt.show()


If we look at the data with only a high non-driver commuter ratio, the linear relationship is very clear. That funky corner only consists of lower non-driver commuter ratios.


In [6]:
model=sm.OLS(np.log(asub['abridged2013ycdrpoffpc']),sm.add_constant(np.log(asub['MRUE001']),prepend=False))
fit=model.fit()
fit.summary()


Out[6]:
OLS Regression Results
Dep. Variable: abridged2013ycdrpoffpc R-squared: 0.689
Model: OLS Adj. R-squared: 0.689
Method: Least Squares F-statistic: 1507.
Date: Mon, 30 Nov 2015 Prob (F-statistic): 1.17e-174
Time: 19:23:06 Log-Likelihood: -1086.7
No. Observations: 682 AIC: 2177.
Df Residuals: 680 BIC: 2187.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
MRUE001 2.2805 0.059 38.820 0.000 2.165 2.396
const -23.8229 0.605 -39.367 0.000 -25.011 -22.635
Omnibus: 50.788 Durbin-Watson: 1.209
Prob(Omnibus): 0.000 Jarque-Bera (JB): 71.269
Skew: 0.586 Prob(JB): 3.34e-16
Kurtosis: 4.065 Cond. No. 138.

In [34]:
#subset=full[['MRUE001','abridged2013ycdrpoffpc']].dropna()
#these are global, you should probably just add them to
#an rc file, which will make them perminant.
#plt.rc('text', **{'usetex':True,'latex.preamble':[
#       r'\usepackage{siunitx}',   
#       r'\sisetup{detect-all}',   
#       r'\usepackage{helvet}',    
#       r'\usepackage{sansmath}',  
#       r'\sansmath'               
#]  })
comd=0.8
asub=subset[(subset['nondrivercommuterrat']>comd)]
bsub=subset[(subset['nondrivercommuterrat']<comd)]
xdata=((((asub['MRUE001']))).as_matrix())
ydata=asub['abridged2013ycdrpoffpc'].as_matrix()

buff=.8
XX=np.linspace(xdata.min(),xdata.max(),num=100)

matplotlib.rc('legend', fontsize=10)

a=6
gr=(1+np.sqrt(5))/2
plt.figure(figsize=[a*gr,a])


plt.xlabel('Per-capita income')
plt.ylabel('2013 yellow cab dropoffs per-capita')
plt.title("Yellow cab dropoffs vs income of census tract")

plt.autoscale(enable=True, axis='y', tight=True)


a=plt.scatter(xdata,ydata,alpha=0.5,label='Data, non-driver commuter ratio $>$ 0.8',c=asub['nondrivercommuterrat'],cmap=plt.cm.get_cmap('viridis'))
plt.scatter(bsub['MRUE001'],bsub['abridged2013ycdrpoffpc'],facecolors='none',edgecolors='black',alpha=0.1,label='Data, non-driver commuter ratio $<$ 0.8')

plt.plot(XX,(XX**fit.params['MRUE001'])*np.exp(fit.params['const']),color='red',alpha=0.5,lw=2, label='Fit, $\mathsf{R^2}=$ ' + str(round(fit.rsquared,2)))
cbar=plt.colorbar(a)
cbar.set_label("non-driver commuter ratio")

plt.yscale('log')
plt.xscale('log')

plt.legend(loc='upper left')
#plt.savefig('2013dropoffs_vs_esb-duration.svg')
#plt.ylim(-0.2,7)
plt.show()


We've got a fit that explains most of the variance, with a $R^2$ of 0.689. I suspect that I won't be able to further explain this subset (non-driver commute ratio > 0.8) of the data, but maybe I will.

However, there is still lots to do. I should probably do an analysis of error in this data, to see if there is really any more variance that could possibly be explained (i.e. much of the variance might be due to random fluctuations over time). I also might want to do some analysis to better understand the causal relationship between these two variables. Many of these dropoffs are not very representative of the population, but instead are common destinations, eg Grand Central and Penn Stations, or the area around Central Park. In this case, it is the fact that rents are high around these common destinations that drives the income relationship, not that the rich are more likely to take cabs.

Of course, there is also the problem of the rest of the data that is left to explain. At using just per-capita income as a dependent variable, there doesn't seem to be much correlation (see blob below), and I will need to explore other variables. I might want to use some more advance machine learning techniques in order to figure out which data is useful.


In [36]:
#subset=full[['MRUE001','abridged2013ycdrpoffpc']].dropna()
#these are global, you should probably just add them to
#an rc file, which will make them perminant.
#plt.rc('text', **{'usetex':True,'latex.preamble':[
#       r'\usepackage{siunitx}',   
#       r'\sisetup{detect-all}',   
#       r'\usepackage{helvet}',    
#       r'\usepackage{sansmath}',  
#       r'\sansmath'               
#]  })
comd=0.8
asub=subset[(subset['nondrivercommuterrat']<comd)]
bsub=subset[(subset['nondrivercommuterrat']>comd)]
xdata=((((asub['MRUE001']))).as_matrix())
ydata=asub['abridged2013ycdrpoffpc'].as_matrix()
#xrdata=((((subset[subset['nondrivercommuterpc']<=comd]['nondrivercommuterpc']))).as_matrix())
#yrdata=subset[subset['nondrivercommuterpc']<=comd]['abridged2013ycdrpoffpc'].as_matrix()
#plt.xlim(-(0.001)**2,(0.003)**2)
buff=100
XX=np.linspace(xdata.min()-buff,xdata.max()+(buff),num=100)

matplotlib.rc('legend', fontsize=10)

a=6
gr=(1+np.sqrt(5))/2
plt.figure(figsize=[a*gr,a])

#plt.xlim(xdata.min()-buff,xdata.max()+buff)
plt.xlabel('Income')
plt.ylabel('2013 yellow cab dropoffs per-capita')
#plt.ylim(-0.2,7)
plt.title("Yellow cab dropoffs vs income of census tract")
#plt.title("Trip duration matters")
#plt.xlim(0,0.002)
plt.autoscale(enable=True, axis='both', tight=True)

a=plt.scatter(xdata,ydata,alpha=0.5,label='Data, non-driver commuter ratio $<$ 0.8',c=asub['nondrivercommuterrat'],cmap=plt.cm.get_cmap('viridis'))

plt.scatter(bsub['MRUE001'],bsub['abridged2013ycdrpoffpc'],facecolors='none',edgecolors='black',alpha=0.1,label='Data, non-driver commuter ratio $>$ 0.8')

#plt.plot(XX,(XX**fit.params['MRUE001'])*np.exp(fit.params['const']),color='red',alpha=0.5,lw=2, label='Fit, $\mathsf{R^2}=$ ' + str(round(fit.rsquared,2)))
#plt.scatter(xrdata,yrdata,alpha=1,label='Data (TLC & google maps)',c=subset[subset['nondrivercommuterpc']<=0.5]['nondrivercommuterpc'],cmap=plt.cm.get_cmap('autumn'))
cbar=plt.colorbar(a)
cbar.set_label("non-driver commuter ratio")
#plt.scatter(xrdata,yrdata,alpha=1,label='Data (TLC & google maps)',color='red')
#plt.colorbar(a)
plt.yscale('log')
plt.xscale('log')

#fitlabel='Fit: '+str(round(res.params['esb-inv'],2))+ '/x - '+ str(abs(round(res.params['const'],2)))
#plt.plot(XX,res.params['esb-inv']/XX+res.params['esb-centroidduration']*XX+res.params['const'],alpha=0.9,color='red',linewidth=2,label=fitlabel)
plt.legend(loc='upper left')
#plt.savefig('2013dropoffs_vs_esb-duration.svg')
#plt.ylim(-0.2,7)
plt.show()



In [ ]: